High-throughput single-cell technologies facilitate the generation of -omic readouts from thousands of cells captured at different cellular maturation stages during development, or other normal or pathological processes with unprecedented resolution. A single snapshot of an asynchronously developing specimen, for example, constitutes a time series in which individual cells represent distinct time points along a continuum. However, recoding of valuable cell-specific information, such as a cell’s developmental age, its location in a tissue, or its functional phenotype is limited during sample preparation, and remains hidden in high dimensional cellular expression profiles. This formulates the computational challenge to infer the latent internal time axis of the biological process from the obtained expression matrix alone, while considering common parameters of single-cell measurements, such as noise, dropouts and redundancy. In other words, biological samples need to be placed by means of hidden information onto a non-linear trajectory, which might constitute of branching processes towards distinct functional cell types.
This manual describes the practical use of the CellTrails implementation, an unsupervised algorithm for the de novo chronological ordering, visualization and analysis of single-cell expression data. CellTrails makes use of a geometrically motivated concept of lower-dimensional manifold learning, which exhibits a multitude of virtues that counteract intrinsic noise of single cell data caused by drop-outs, technical variance, and redundancy of predictive variables. CellTrails enables the reconstruction of branching trajectories and provides an intuitive graphical representation of expression patterns along all branches simultaneously. It allows the user to define and infer the expression dynamics of individual and multiple pathways towards distinct phenotypes.
CellTrails was developed with a 183-dimensional RT-qPCR gene expression panel of 1,008 cells collected from the developing chicken utricle, a balance organ. Key players in the utricle’s function are cohorts of sensory hair cells that display mechanosensing organelles, called hair bundles, protruding from their apical surfaces. Bundle growth and maturation is dictated by an orchestration of distinct sequential and overlapping cellular processes. Our goal was to elucidate the temporal expression program of key hair bundle genes in subtypes of hair cells that occur with distinct spatial distribution. We showed that CellTrails faithfully predicted expression patterns of hair cell maturation with unprecedented resolution.
We confirmed that CellTrails can be applied to analysis of single-cell RNA-Seq datasets. We are pleased that you consider using CellTrails in your research. A detailed theoretical description of the algorithm and its application to biological uses has been published in:
Ellwanger DC, Scheibinger M, Dumont RA, Barr-Gillespie PG, and Heller S. “Transcriptional dynamics of hair-bundle morphogenesis revealed with CellTrails”. Journal_ Date;Issue.
CellTrails 0.99.0
The CellTrails package comes with an example dataset (of class ExpressionSet from the Biobase package), which can be loaded with the function data.
# Load example data
data("gga_e15_utricle")
This dataset contains transcript expression profiles of 183 genes expressed during sensory hair cell bundle maturation and function, which were quantified in the chicken utricle sensory epithelium at embryonic day 15 using multiplex RT-qPCR. Experimental metadata was generated during tissue preparation (cell origin) and cell sorting (uptake of FM1-43 dye indicating cell maturity). This data set is the foundation used for the development of CellTrails. If you use this dataset for your research, please cite:
Ellwanger DC, Scheibinger M, Dumont RA, Barr-Gillespie PG, and Heller S. “Transcriptional dynamics of hair-bundle morphogenesis revealed with CellTrails”. Journal Date;Issue. doi:
The following illustrates the typical sequence of steps performed during a CellTrails analysis. Shown is the signature of each function as provided in the CellTrails package documentation (call ?name_of_function):
## Not run:
# -------------------------------
# Container Creation
# -------------------------------
ctset <- as.CellTrailsSet(object) #either numeric matrix or ExpressionSet
# -------------------------------
# Feature Selection
# -------------------------------
# Filter features expressed in a minimum number of samples
ctset <- filterFeaturesByDL(ctset, threshold)
# Filter features having a minimum coefficient of variation
ctset <- filterFeaturesByCOV(ctset, threshold, design=NULL)
# Filter features having a significant high Fano factor
ctset <- filterFeaturesByFF(ctset, threshold=1.7, min_expr=0, design=NULL)
# -------------------------------
# Manifold Learning
# -------------------------------
# Spectral Embedding
ctset <- embedSamples(ctset, design=NULL)
# Dimensionality Reduction
ctspec <- findSpectrum(ctset, frac=100)
plot(ctspec)
ctset <- reduceDimensions(ctset, ctspec)
plot(ctset, type="latentSpace", feature_name, pheno_type, viz, perplexity=30,
seed=1101)
plot(ctset, type="latentSpace", pheno_type, viz, perplexity=30, seed=1101)
# -------------------------------
# Clustering
# -------------------------------
ctset <- findStates(ctset, min_size=0.01, min_feat=5, max_pval=1e-04, min_fc=2)
plot(ctspec, type="stateSize")
plot(ctset, type="stateExpression", feature_name)
# -------------------------------
# Sample Ordering
# -------------------------------
# State Trajectory Graph
ctset <- connectStates(ctset, l=10)
plot(ctset, type="stateTrajectoryGraph", feature_name, component, seed=1101)
plot(ctset, type="stateTrajectoryGraph", pheno_type, component, point_size=3,
label_offset=2, seed=1101)
ctset <- selectTrajectory(ctset, component)
# Alignment of Samples onto Trajectory
ctset <- fitTrajectory(ctset)
plot(ctset, type="trajectoryFit")
# -------------------------------
# CellTrails Maps
# -------------------------------
# Graph Layout
write.ygraphml(ctset, file, feature_name=NULL, label=NULL)
write.ygraphml(ctset, file, pheno_type=NULL, label=NULL)
ctmaps <- read.ygraphml(ctset, file, adjust=TRUE)
# Plot maps
plot(ctset, type="map", feature_name, map_type=c("full", "se", "single"))
plot(ctset, type="map", pheno_type)
# -------------------------------
# Analysis of Expression Dynamics
# -------------------------------
# Trail Identification
plot(ctset, type="trailblazing")
ctmaps <- addTrail(ctset, from, to, name)
plot(ctset, type="trail", trail_name)
# Inference of Dynamics
plot(ctset, type="dynamic", feature_name, trail_name)
fit <- fitDynamic(ctset, feature_name, trail_name)
# Dynamic Comparison: Within Trails
plot(ctset, type="dynamic", feature_name, trail_name)
fit1 <- fitDynamic(ctset, feature_name, trail_name)
fit2 <- fitDynamic(ctset, feature_name, trail_name)
cor(fit1$expression, fit2$expression)
# Dynamic Comparison: Between Trails
contrastTrailExpr(ctset, feature_name, trail_names, score)
## End(Not run)
In the following those steps will be illustrated in detail.
CellTrails organizes its data in a container called CellTrailsSet. This class extends the class ExpressionSet from the Biobase package (Huber et al. 2015) and provides all attributes required for smooth and user-friendly data processing. In the following, it is described how to initialize, manipulate and display an object of this class.
The class ExpressionSet is part of the Biobase package, which provides base functions for Bioconductor. It allows for intuitive organizing of quantitative expression data and experimental metadata. To make best use of CellTrails (and other Bioconductor packages), it is recommended to store individual datasets in a ExpressionSet object. For further details on this class, please refer to its vignette (Falcon, Morgan, and Gentleman 2007).
A CellTrailsSet can be initialized from an ExpressionSet object as follows:
class(gga_e15_utricle)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
# Initialize CellTrailsSet
ctset <- as.CellTrailsSet(gga_e15_utricle)
Please note, that function as.ExpressionSet enables to (back) coerce any CellTrailsSet object into an ExpressionSet.
The minimal requirement to initialize a CellTrailsSet object is a numerical matrix containing quantitative expression measurements. Here, features (e.g., genes or transcripts) are represented by rows and samples (e.g., single cells) are represented by columns. We recommend assigning meaningful row and column names.
# Expression matrix
edat <- exprs(gga_e15_utricle)
edat[1:5, 1:5]
## Cell-1-1 Cell-1-2 Cell-1-3 Cell-1-4 Cell-1-5
## ABCA5 9.703964 8.381090 0.000000 12.23502 10.20736
## ARF1 10.700009 12.265743 11.482489 12.14234 11.09284
## ATP6V1B2 11.985085 9.423574 9.734314 11.78410 11.56234
## CALB2 13.784440 0.000000 4.759918 14.06947 11.38170
## CIB2 5.260841 0.000000 6.756759 0.00000 0.00000
# Initialize CellTrailsSet
ctset <- as.CellTrailsSet(edat)
Since CellTrailsSet extends ExpressionSet, one can add further details about each feature (e.g., accession numbers, functional annotations, usf.) and sample (e.g., information collected during cell preparation and sorting) by using the functions featureData and phenoData, respectively.
# Sample information
sample_info <- pData(gga_e15_utricle)
sample_info[745:750, ]
## FM143 ORIGIN PANEL
## Cell-8-91 3.high <NA> 8
## Cell-8-92 3.high <NA> 8
## Cell-8-93 3.high <NA> 8
## Cell-9-1 <NA> lateral 9
## Cell-9-2 <NA> lateral 9
## Cell-9-3 <NA> lateral 9
# Feature information
feature_info <- fData(gga_e15_utricle)
feature_info[1:5, 1:3]
## ASSAY_ID ASSAY_NAME PRIMER_REFSEQ
## ABCA5 GEP00062062 ABCA5_62062_i34 XM_004946235.1
## ARF1 GEP00061884 ARF1_61884_i2 NM_001006352.1
## ATP6V1B2 GEP00061891 ATP6V1B2_61891_i11 XM_424534.4
## CALB2 GEP00058544 CALB2_58544_i3 NM_205316.1
## CIB2 GEP00061906 CIB2_61906_i2 XM_004943719.1
# Add metadata to CellTrailsSet
featureData(ctset) <- new("AnnotatedDataFrame", data=feature_info)
phenoData(ctset) <- new("AnnotatedDataFrame", data=sample_info)
All required manipulations to reconstruct a trajectory, and to extract and analyze expression dynamics are implicitly performed by the functions provided in this package. However, if needed, the data stored in a CellTrailsSet object can be directly accessed. Here are some examples on indexing a CellTrailsSet object:
# Access sample information
ctset$ORIGIN[1000:1005]
## [1] lateral lateral lateral lateral lateral medial
## Levels: lateral medial
# Access assay data
ctset["ABCA5", ][10:15]
## Cell-1-10 Cell-1-11 Cell-1-12 Cell-1-13 Cell-1-14 Cell-1-15
## 7.169828 11.843093 6.494406 6.457518 11.074259 11.253876
ctset[, 3][1:5]
## ABCA5 ARF1 ATP6V1B2 CALB2 CIB2
## 0.000000 11.482489 9.734314 4.759918 6.756759
For further details, please refer to the class documentation (call ?CellTrailsSet) and the next section on displaying a CellTrailsSet object.
A CellTrailsSet object can be inspected by using the functions show and plot.
By calling the variable name of the CellTrailsSet object, an informative overview is printed.
# Show object
ctset
## [[ CellTrailsSet ]]
## assayData: 183 features, 1008 samples
## element names: exprs
## phenoData:
## sampleNames: "Cell-1-1" "Cell-1-2" ... "Cell-11-82" (1008)
## varLabels: "FM143" "ORIGIN" "PANEL" (3)
## varMetadata: labelDescription
## featureData:
## featureNames: "ABCA5" "ARF1" ... "USH2A" (183)
## fvarLabels: "ASSAY_ID" "ASSAY_NAME" ... "PRIMER_EFFICIENCY" (7)
## fvarMetadata: labelDescription
## mapData:
## trajectoryFeatures: "ABCA5" "ARF1" ... "USH2A" (183)
## latentSpace: none
## stateTrajectoryGraph: none
## trajectoryStates: none
## trajectorySamples: none
## trajectoryFit: none
## trajectoryLayout: none
## trailData:
## trailNames: none
The assayData, featureData and phenoData sections correspond to information that is provided by ExpressionSet, while mapData and trailData denote CellTrailsSet specific details. All attributes as listed below can be accessed via the respective function.
mapData
trailData
There are several ways to visualize the data contained in a CellTrailsSet object using the function plot(object, type, ...). The function returns a ggplot object (Wickham 2009) from the ggplot2 package, which can be adapted by the user’s needs and preferences. Detailed information on each plot type is provided in the function’s documentation (call ?plot.CellTrailsSet). The available plot types are:
type
After setting up a CellTrailsSet object, CellTrails is ready to reconstruct the trajectory. An expression matrix from an single-cell RNA-Seq experiment contains hundreds of features that bear no or little information about a cell’s progress through the trajectory of interest. These non-relevant features not only increase the computational runtime of the manifold learning, but they also impair the accuracy of downstream analysis results. CellTrails assumes that key features show high expression variability during a cell’s progression along the biological process under study. Therefore, CellTrails enables to unbiasedly filter the informative features.
Please note, that the feature selection step is optional for single-cell RT-qPCR data, because genes were handpicked by an expert in its field and/or are based on literature evidence.
A freshly initialized CellTrailsSet instance assumes that all features are used for trajectory reconstruction:
# Generate random scRNA-Seq example data
# for 15,000 genes and 100 cells
sim <- simulate_exprs(n_features=15000, n_samples=100)
# Container Creation
ctset_sim <- as.CellTrailsSet(sim)
# Dimensions of expression matrix
dim(ctset_sim)
## Features Samples
## 15000 100
# Number of trajectory features
length(trajectoryFeatures(ctset_sim))
## [1] 15000
Please note, that the following functions only filter the features used for dimensionality reduction, state detection and trajectory reconstruction; all quantified features are still available for donwstream analyses, such as cluster analysis and inference of gene expression dynamics.
This function determines trajectory features that are present in a disproportionate small number of samples. It removes features that are not expressed or that do not sufficiently reach the technological limit of detection. The detection level is defined as the fraction of samples in which a feature is detected, i.e. the relative number of samples having a feature’s expression value greater than 0. If a threshold >= 1 is selected, its value is automatically converted to a relative fraction of the total sample count. The empirical cumulative distribution function of all samples and the fraction of removed features is shown.
# Filter features expressed in a minimum number of samples
ctset_sim <- filterFeaturesByDL(ctset_sim, threshold=2)
# Dimensions of expression matrix
dim(ctset_sim)
## Features Samples
## 15000 100
# Number of trajectory features
length(trajectoryFeatures(ctset_sim))
## [1] 9896
This function determines trajectory features with a narrow standard deviation (sd) with respect to its average expression (mean). This filter removes features with high expression and low variance, such as housekeeping genes. The coefficient of variation is computed by CoV(x) = sd(x)/mean(x). Features with a CoV(x) greater than a given threshold remain labeled as trajectory feature in the CellTrailsSet object. The empirical cumulative distribution function of all samples and the fraction of removed features is shown.
# Filter features having a minimum coefficient of variation
ctset2 <- filterFeaturesByCOV(ctset_sim, threshold=.5)
# Dimensions of expression matrix
dim(ctset_sim)
## Features Samples
## 15000 100
# Number of trajectory features
length(trajectoryFeatures(ctset_sim))
## [1] 9896
Please note, that the function allows to block nuisance factors, such as batch, gender or cell cycle effects, for the filtering of proper features. Please refer to section MANIFOLD LEARNING/BLOCKING UNINFORMATIVE SUBSTRUCTURES to see an example on how to set the parameter design, respectively.
This function identifies the most variable features while considering their average expression level. Features are placed into 20 bins based on their mean expression. For each bin, the distribution of Fano factors, which is a windowed version of the index of dispersion (IOD = variance / mean), is computed and standardized (Z-score(x) = x/sd(x) - mean(x)/sd(x)). Highly variable features with a Z-score greater than a given threshold remain labeled as trajectory feature in the CellTrailsSet object. The parameter min_expr defines the minimum average expression level of a feature to be considered for this filter. The Fano factor and the average expression is shown for each feature; filtered features are highlighted.
# Filter features having a significant high Fano factor
ctset_sim <- filterFeaturesByFF(ctset_sim, threshold=1.7, min_expr=0)
# Dimensions of expression matrix
dim(ctset_sim)
## Features Samples
## 15000 100
# Number of trajectory features
length(trajectoryFeatures(ctset_sim))
## [1] 598
Please note, that the function allows to block nuisance factors, such as batch, gender or cell cycle effects, for the filtering of proper features. Please refer to section MANIFOLD LEARNING/BLOCKING UNINFORMATIVE SUBSTRUCTURES to see an example on how to set the parameter design, respectively.
The samples’ expression profiles are shaped by many factors, such as developmental age, tissue region of origin, cell cycle stage, as well as extrinsic sources such as status of signaling receptors, and environmental stressors, but also technical noise. In other words, a single dimension, despite just containing feature expression information, represents an underlying combination of multiple dependent and independent, relevant and non-relevant factors, whereat each factor’s individual contribution is non-uniform. To obtain a better resolution and to extract underlying information, CellTrails aims to find a meaningful low-dimensional structure - a manifold - that represents samples mainly by their latent temporal relation.
CellTrails aims to decipher the temporal relation between samples by computing a novel data representation which amplifies trajectory information in its first n dimensions. For this purpose, CellTrails employs spectral graph theory. Due to their locality-preserving character, spectral embedding techniques are advantageous because these consider the data’s manifold structure, are insensitive to outliers and noise, are not susceptible to short-circuiting, and emphasize naturally occurring clusters in the data (Belkin and Niyogi 2003; Sussman et al. 2012). In a nutshell, CellTrails assumes that two samples that have a high statistical dependency are represented in close proximity along a trajectory. CellTrails captures the intrinsic data geometry as a weighted graph (nodes = samples, edges = statistical dependencies between pairs of samples) by means of fuzzy mutual information and uses spectral graph decomposition to unfold the manifold revealing the hidden trajectory information.
The spectral embedding is performed using the function embedSamples and results in an eigenspace representation of the original expression data.
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# Spectral Embedding
ctset <- embedSamples(ctset)
## Computing adjacency matrix ...
## Computing spectral embedding ...
dim(latentSpace(ctset))
## [1] 1008 1008
CellTrails assumes that the expression vectors are lying on or near a manifold with a low dimensionality that is embedded in the higher-dimensional space. The number of dimensions can be reduced, which lowers noise (i.e. truncates non-relevant dimensions), while the geometry of the trajectory is emphasized.
The function findSpectrum helps to identify the intrisic dimensionality of the data. It determines the most informative dimensions based on the eigenvalues (spectrum) of the eigenspace. Components of the latent space are ranked by their information content. In the following example, CellTrails identifies 9 relevant components for the E15 chicken utricle dataset using a linear fit on the eigengaps of the first 100 eigenvalues, as can be visualized by the respective plot function.
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# Dimensionality Reduction
ctspec <- findSpectrum(ctset, frac=100)
plot(ctspec)
ctset <- reduceDimensions(ctset, ctspec)
dim(latentSpace(ctset))
## [1] 1008 9
We suggest to assess the resulting Scree plot (eigengaps versus spectrum size) for whether the estimation of the unknown intrinsic dimensionality was reasonable. Otherwise, we recommend to adjust the parameter frac accordingly.
Please note, that considering too few components of the latent space may result in loss of information, while selecting lower ranked components could increase noise.
CellTrails allows us to visualize the truncated latent space. CellTrails’ plot function with the parameter type="latentSpace" uses t-distributed stochastic neighbor embedding (van der Maaten and Hinton 2008) to illustrate the arrangement of the samples in the (in this example: 9-dimensional) latent space in a two-dimensional plot. To reduce computational time, a previously computed tSNE can be re-used. Points denote individual cells, the colorization indicates either a metadata label (here: FM1-43 dye uptake) or expression of a single feature (here: gene MYO7A).
Please note, that empty points denote a missing label or missing expression value (non-detects). The optional parameters seed and perplexity are passed to the Barnes-Hut implementation of t-distributed stochastic neighbor embedding.
# Plot latent space w/ FM1-34 uptake
gp <- plot(ctset, type="latentSpace", pheno_type="FM143", seed=1101,
perplexity=30)
## Computing 2D visualization of manifold ...
gp
# Plot latent space w/ MYO7A
# Re-use previous tSNE result to reduce computation time
# Parameter viz contains previous tSNE result
plot(ctset, type="latentSpace", feature_name="MYO7A", viz=gp)
Single-cell measurements are susceptible to the influence of confounders, such as batch, gender or cell cycle effects. Blocking these nuisance factors during manifold learning may be necessary to significantly improve the result of downstream data analyses, such as reconstruction of the temporal trajectory. Therefore, the function embedSamples can account for confounding effects via the parameter design, as will be demonstrated on the example of single-cell RNA-Seq data of murine T helper 2 cell (Th2) differentiation (Mahata et al. 2014). In a nutshell, Buettner et al. (F. Buettner et al. 2015) identified cell cycle effects as major confounder in this dataset and applied a single-cell latent variable model (scLVM) approach to account for this factor. They unbiasedly identified then two cell populations, namely a group of partially and a group of fully differentiated cells. The normalized, log transformed and filtered scRNA-Seq data can be obtained from the supplementary materials of this article (Table S5 and S7); further, a curated list of Th2 marker genes, the scLVM ‘corrected’ expression matrix, and a binary cluster assignment for each cell can be downloaded.
Let’s assume the numeric expression matrix (here called th2) and the list of marker genes are already organized in a Biobase::ExpressionSet container. Here, the expression matrix consists of 7,063 genes (116 of which are marker genes) which have been detected in more than 2 of all 81 cells.
th2 <- readRDS(system.file("exdata", "th2.rds", package="CellTrails"))
th2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 7063 features, 81 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: Gnai3 Cdc45 ... X4933404O12Rik (7063 total)
## fvarLabels: isMarker ENSEMBL
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 24813893 25599176
## Annotation: Single-cell RNA-Seq data of murine T helper 2 cell (Th2) differentiation
# Number of markers
nMarkers <- sum(fData(th2)$isMarker)
nMarkers
## [1] 116
# Number of total genes
nGenes <- nrow(th2)
nGenes
## Features
## 7063
First, we have a quick look into the unprocessed dataset. If the latent temporal factor is a major source of variance, two clusters, which separate fully from partially differentiated cells, should be detectable; if those clusters are not identifiable, the data is affected by uniformative substructures. We assume that Th2 marker genes should be enriched in the group of genes differentially expressed between clusters, i.e. the enrichment odds ratio should be > 1 and the enrichment P-value should be significant if cells were clustered by maturity.
# Clustering in the original space
D <- dist(t(exprs(th2)))
dendro <- hclust(D, method="ward.D2")
cluster <- cutree(dendro, k=2)
# Differential expression
pvals <- apply(exprs(th2), 1, function(x) {
wilcox.test(x[cluster == 1],
x[cluster == 2],
exact=FALSE)$p.value})
fdr <- p.adjust(pvals, method="fdr")
# Number of differentially expressed markers for FDR < 0.05
de <- names(fdr[fdr < 0.05]) #differentially expressed genes
deGenes <- length(de) #number of genes
deMarkers <- sum(fData(th2[de, ])$isMarker) #number of markers
# Enrichment statistic
enrichment.test(deMarkers, nMarkers, deGenes, nGenes)
## $p.value
## [1] 0.989218
##
## $odds.ratio
## odds ratio
## 0.6526187
##
## $conf.int
## [1] 0.4687965 Inf
## attr(,"conf.level")
## [1] 0.95
##
## $method
## [1] "Fisher's exact test for enrichment"
Since the enrichment is not significant (with an odds ratio < 1), we argue that cells were not properly separated by maturity in the original space.
To block the cell cycle effects, CellTrails expects a design matrix modeling the cell cycle stage as the explanatory factor for each cell. As the cell-cycle stage of each cell is not known in this data set, we need to predict cell cycle phases. In this example, we use the classifier cyclon from the scran package (Lun, McCarthy, and Marioni 2016). To be able to run the algrithm properly, gene symbols were translated to Ensembl identifiers using Bioconductors’ annotation database interface package AnnotationDbi (Pages et al. 2017) and the mouse annotation data package org.Mm.eg.db (Carlson 2017).
Please note, that these packages are not part of CellTrails and may needed to be installed first (see footnotes for details).
# Run cyclone
mcm <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
package="scran"))
set.seed(1101)
cellCycle <- scran::cyclone(x=exprs(th2),
pairs=mcm,
gene.names=fData(th2)$ENSEMBL)
# Number of predicted phases
table(cellCycle$phases)
##
## G1 G2M S
## 59 14 8
Let’s create the respective design matrix using the cyclon classification scores.
# Design matrix
cc_design <- model.matrix(~ cellCycle$scores$G1 + cellCycle$scores$G2M)
head(cc_design)
## (Intercept) cellCycle$scores$G1 cellCycle$scores$G2M
## 1 1 0.108 0.701
## 2 1 1.000 0.002
## 3 1 0.971 0.009
## 4 1 1.000 0.000
## 5 1 1.000 0.000
## 6 1 0.671 0.338
Next, we reduce the dimensionality using CellTrails. Passing the design matrix to embedSamples ensures that CellTrails properly regresses out the effects of the explanatory variables before learning the manifold. Then, we cluster the cells in the derived lower-dimensional space.
# Perform Dimensionality Reduction with Design Matrix
th2_ctset <- as.CellTrailsSet(th2)
th2_ctset <- embedSamples(th2_ctset, design=cc_design)
## Blocking nuisance factors ...
## Computing adjacency matrix ...
## Computing spectral embedding ...
th2_ctset <- reduceDimensions(ctset=th2_ctset,
ctspec=findSpectrum(th2_ctset))
# Clustering in Latent Space
D <- dist(latentSpace(th2_ctset))
dendro <- hclust(D, method="ward.D2")
cluster <- cutree(dendro, k=2)
We test the quality of clustering by quantifying the enrichment of marker genes in the set of differentially expressed genes.
# Differential expression
pvals <- apply(exprs(th2), 1, function(x) {
wilcox.test(x[cluster == 1],
x[cluster == 2],
exact=FALSE)$p.value})
fdr <- p.adjust(pvals, method="fdr")
# Number of differentially expressed markers for FDR < 0.05
de <- names(fdr[fdr < 0.05]) #differentially expressed genes
deGenes <- length(de) #number of genes
deMarkers <- sum(fData(th2[de, ])$isMarker) #number of markers
# Enrichment statistic
enrichment.test(deMarkers, nMarkers, deGenes, nGenes)
## $p.value
## [1] 2.464631e-06
##
## $odds.ratio
## odds ratio
## 6.887097
##
## $conf.int
## [1] 3.661379 Inf
## attr(,"conf.level")
## [1] 0.95
##
## $method
## [1] "Fisher's exact test for enrichment"
The marker gene enrichment is significant (P-value < 10-5) and the odds ratio is remarkably increased to ~7, indicating that the cells are now properly separated by maturity. In comparison, an enrichment odds ratio of 2.4 was achieved using the cell-cycle ‘corrected’ data and the clustering provided in the original scLVM study (F. Buettner et al. 2015).
Please note, that the differential gene expression analysis using the CellTrails derived clusters was performed on the actual expression matrix and not the cell-cycle ‘corrected’ expression values. In contrast to scLVM, CellTrails blocks the nuisance variables for manifold learning only and keeps the original expression values for downstream analysis. This is due to the fact that the manipulated expression matrix does not represent the actual transcript levels measured in each cell, nor does it account for the uncertainty of estimation of the blocking factor terms. By this means, CellTrails protects against confounding effects without discarding information.
Besides cell cycle, technical confounders may also be relevant to be accounted for. Those can occur, for example, if cells were processed on different plates or if cells were pooled from multiple sequencing runs. In this case, a design matrix with the respective explanatory variables can be constructed and passed to embedSamples.
If the user prefers to use an alternative approach for dimensionality reduction, the latent space can be set to a CellTrailsSet object with latentSpace<-. The latent space has to be a numerical matrix; rows represent samples and columns the components of the latent space. CellTrails uses by default spectral embedding, but the framework also operates well with any other spectral dimensionality reduction method, such as PCA (available in CellTrails via function pca) and diffusion maps (available via the destiny package (Angerer et al. 2015)):
# Create CellTrailsSet object
ctset2 <- as.CellTrailsSet(gga_e15_utricle)
# Perform dimensionality reduction
pca_result <- pca(ctset2)
## Performing PCA ...
dmaps_result <- destiny::DiffusionMap(t(exprs(ctset2)), n_eigs=110)
# EITHER
# Set PCA result to CellTrailsSet object
latentSpace(ctset2) <- pca_result$princomp
eigenvalues(ctset2) <- pca_result$variance
# OR
# Set diffusion map result to CellTrailsSet object
latentSpace(ctset2) <- dmaps_result@eigenvectors
eigenvalues(ctset2) <- dmaps_result@eigenvalues
# Continue with CellTrails workflow
# Determine intrinsic dimensionality
ctspec2 <- findSpectrum(ctset2)
plot(ctspec2)
# Reduce dimensionality
ctset2 <- reduceDimensions(ctset2, ctspec2)
ctset2
## [[ CellTrailsSet ]]
## assayData: 183 features, 1008 samples
## element names: exprs
## phenoData:
## sampleNames: "Cell-1-1" "Cell-1-2" ... "Cell-11-82" (1008)
## varLabels: "FM143" "ORIGIN" "PANEL" (3)
## varMetadata: labelDescription
## featureData:
## featureNames: "ABCA5" "ARF1" ... "USH2A" (183)
## fvarLabels: "ASSAY_ID" "ASSAY_NAME" ... "PRIMER_EFFICIENCY" (7)
## fvarMetadata: labelDescription
## mapData:
## trajectoryFeatures: "ABCA5" "ARF1" ... "USH2A" (183)
## latentSpace: 1008 samples, 11 components
## stateTrajectoryGraph: none
## trajectoryStates: none
## trajectorySamples: none
## trajectoryFit: none
## trajectoryLayout: none
## trailData:
## trailNames: none
Please note, that the function latentSpace<- accepts any numerical matrix. Therefore, any latent space with an already reduced number of dimensions can be assigned to a CellTrailsSet object with this function; eigenvalues are only required to determine the intrinsic dimensionality of the data set.
To identify cellular subpopulations, CellTrails performs hierarchical clustering via minimization of a square error criterion (Ward 1963) in the lower-dimensional space. To determine the number of clusters, CellTrails conducts an unsupervised post-hoc analysis. Here, it is assumed that differential expression of assayed features determines distinct cellular stages. Hierarchical clustering generates a cluster dendrogram. CellTrails makes use of this information and identifies the maximal fragmentation of the data space, i.e. the lowest cutting height in the clustering dendrogram that ensures that the resulting clusters contain at least a certain fraction of samples. Then, processing from this height towards the root, CellTrails iteratively joins siblings if they do not have at least a certain number of differentially expressed features. Statistical significance is tested by means of a two-sample non-parametric linear rank test accounting for censored values (R. Peto and Peto 1972). The null hypothesis is rejected using the Benjamini-Hochberg (Benjamini and Hochberg 1995) procedure for a given significance level. The number of clusters can impact the outcome of the trajectory reconstruction and, therefore, this step might require some parameter tuning depending on the input data (for more information on the parameters call ?findStates).
CellTrails identified 11 clusters in the E15 chicken utricle dataset, which are referred to as cellular states along the trajectory.
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# Clustering
ctset <- findStates(ctset, min_size=0.01, min_feat=5, max_pval=1e-4, min_fc=2)
## Initialized 25 clusters with a minimum size of 10 samples each.
## Performing post-hoc test ...
## Found 11 states.
The cluster labels can be accessed with the function states(ctset). They are also available for all plot types providing the parameter pheno_type, e.g.:
plot(ctset, type="latentSpace", pheno_type="state")
## Computing 2D visualization of manifold ...
By having a look at the size of those states, one can see that state S4 clearly harbors the largest fraction of cells.
plot(ctset, type="stateSize")
A major cell population in the chicken utricle are the supporting cells and, therefore, it can be hypothesized that state S4 harbors this cell type. CellTrails enables to analyze differential expression between cluters. Let’s have a look at the expression levels of known supporting cell markers per state to corroborate our hypothesis:
plot(ctset, type="stateExpression", feature_name="TECTA")
plot(ctset, type="stateExpression", feature_name="TECTB")
plot(ctset, type="stateExpression", feature_name="OTOA")
The violin plots (vertically mirrored density curves) indicate that state S4 harbors cells that express supporting cell markers. Individual points represents expression levels of single cells.
CellTrails assumes that the arrangement of samples in the computed latent space constitutes a trajectory. CellTrails aims to place single samples along a maximum parsimony tree, which resembles a branching developmental continuum. Distances between samples in the latent space are computed using the Euclidean metric.
To avoid overfitting and to facilitate the accurate identification of bifurcations, CellTrails simplifies the problem. Analogous to the idea of a ‘broken-stick regression’, CellTrails groups the data and performs linear fits to separate trajectory segments, which are determined by the branching chronology of states. This leaves the optimization problem of finding the minimum number of associations between states that maximize the total parsimony. In theory, this problem can be solved by any minimum spanning tree algorithm. CellTrails adapts this concept by assuming that adjacent states should be located nearby and, therefore, share a relative high number of neighboring cells.
Each state defines a submatrix of samples that is composed of a distinct set of data vectors, i.e. each state is a distinct set of samples represented in the lower-dimensional space. For each state, CellTrails identifies the l-nearest neighbors to each state’s data vector and takes note of their state memberships and distances. This results in two vectors of length l times the state size. Subsequently, CellTrails removes spurious neighbors (outliers/false-positive neighbors), which are statistically too distal. For each state CellTrails calculates the relative frequency on how often a state occurs in the neighborhood of a given state, which is referred to as the interface cardinality scores.
CellTrails implements a greedy algorithm to find the tree maximizing the total interface cardinality score, similar to Kruskal’s minimum spanning tree algorithm (Kruskal 1956). The graph construction has a relaxed requirement (number of edges < number of nodes) compared to a tree (number of edges = number of nodes - 1), which may result in a graph having multiple tree components (= forest) indicating potentially independent trajectories or isolated nodes.
Please note, that the function connectStates can be adjusted, such that the resulting number of components may be lower or higher by increasing or decreasing the parameter l, respectively.
In our E15 chicken utricle dataset, we identified two components, as indicated by the stateTrajectoryGraph attribute of the CellTrailsSet object: one tree with 10 states connected by 9 edges, and one isolated state.
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# State Trajectory Graph Computation
ctset <- connectStates(ctset, l=10)
ctset
## [[ CellTrailsSet ]]
## assayData: 183 features, 1008 samples
## element names: exprs
## phenoData:
## sampleNames: "Cell-1-1" "Cell-1-2" ... "Cell-11-82" (1008)
## varLabels: "FM143" "ORIGIN" ... "STATE" (4)
## varMetadata: labelDescription
## featureData:
## featureNames: "ABCA5" "ARF1" ... "USH2A" (183)
## fvarLabels: "ASSAY_ID" "ASSAY_NAME" ... "PRIMER_EFFICIENCY" (7)
## fvarMetadata: labelDescription
## mapData:
## trajectoryFeatures: "ABCA5" "ARF1" ... "USH2A" (183)
## latentSpace: 1008 samples, 9 components
## stateTrajectoryGraph: [Component(#Vertices,Edges)]: 1(10,9) 2(1,0)
## trajectoryStates: none
## trajectorySamples: none
## trajectoryFit: none
## trajectoryLayout: none
## trailData:
## trailNames: none
The CellTrails plot function enables the visualization of the inferred state trajectory graph. If the graph is a forest, the parameter component can be used to define which tree should be shown. Let’s have a look at the FM1-43 uptake and the CALB2 expression in component 1:
plot(ctset, type="stateTrajectoryGraph", component=1, pheno_type="FM143",
point_size=1.5, label_offset=4, seed=1101)
plot(ctset, type="stateTrajectoryGraph", component=1, feature_name="CALB2",
point_size=3, label_offset=2, seed=1101)
The optional parameters seed, point_size and label_offset are useful to adjust the graph layout, the size of the points and the relative position of the point labels, respectively.
For the sake of simplicity and performance, it makes sense to conduct subsequent steps for each component individually. In this case, we select the tree formed by graph component 1 with 896 cells (listed as trajectorySamples) for our continuing E15 chicken utricle analysis.
# Select Component
ctset <- selectTrajectory(ctset, component=1)
ctset
## [[ CellTrailsSet ]]
## assayData: 183 features, 1008 samples
## element names: exprs
## phenoData:
## sampleNames: "Cell-1-1" "Cell-1-2" ... "Cell-11-82" (1008)
## varLabels: "FM143" "ORIGIN" ... "STATE" (4)
## varMetadata: labelDescription
## featureData:
## featureNames: "ABCA5" "ARF1" ... "USH2A" (183)
## fvarLabels: "ASSAY_ID" "ASSAY_NAME" ... "PRIMER_EFFICIENCY" (7)
## fvarMetadata: labelDescription
## mapData:
## trajectoryFeatures: "ABCA5" "ARF1" ... "USH2A" (183)
## latentSpace: 1008 samples, 9 components
## stateTrajectoryGraph: [Component(#Vertices,Edges)]: 1(10,9)
## trajectoryStates: "S1" "S2" ... "S11" (10)
## trajectorySamples: "Cell-1-1" "Cell-1-2" ... "Cell-11-82" (896)
## trajectoryFit: none
## trajectoryLayout: none
## trailData:
## trailNames: none
The selected component (consisting of 10 states) defines the trajectory backbone. The function fitTrajectory embeds the trajectory structure in the latent space by computing straight lines passing through the mediancentres (Bedall and Zimmermann 1979) of adjacent states. Then, a fitting function is learned. Each sample is projected to its most proximal straight line passing through the mediancentre of its assigned state. Here, whenever possible, orthogonal projections onto line segments between two mediancentres are preferred to line segments which are only incident to a single median centre. Fitting deviations are given by the Euclidean distance between the sample’s location and the straight line, and are indicated by an aggregated statistic (Mean Squared Error, MSE) shown in the CellTrailsSet object. Finally, a weighted acyclic trajectory graph can be constructed based on each sample’s position along its straight line. Nodes in this graph are samples; edges are constructed between neighboring samples. Each edge is weighted by the distance between its nodes along the straight line.
Of note, the fitting function implies potential side branches in the trajectory graph; those could be caused due to technical variance or encompass samples that were statistically indistinguishable from the main trajectory given the selected genes used for trajectory reconstruction. The plot function shows the trajectory backbone (longest shortest path between two samples) and the fitting deviations of each sample indicated by the perpendicular jitter; the parameter rev allows the user to inverse the visualization.
ctset <- fitTrajectory(ctset)
plot(ctset, type="trajectoryFit", rev=TRUE)
CellTrails portrays a computed trajectory as collection of trails that can be found in a landscape shaped by individual gene expression dynamics. To generate such a topographic trail map - the CellTrails map - a two-dimensional spatiotemporal ordination of the expression matrix has to be computed. This can be done by any graph layout algorithm using the structural information from the trajectory graph, which is composed of nodes (=samples) and edges (=chronological relation between samples). We found that the freely available graph visualization software yEd (http://www.yworks.com/products/yed) has great capabilities to visualize and analyze a trajectory graph. An optimal layout is planar, i.e. it exhibits no crossing edges or overlapping nodes.
Therefore, CellTrails enables to export and import the trajectory graph structure using the common graphml file format. This file format can be interpreted by most third-party graph analysis applications, allowing the user to subject the trajectory graph to a wide range of (tree) layout algorithms. In particular, the exported file has additional ygraph attributes best suited to be used with yEd, which is freely available for all major platforms (Windows, Mac OS, and Linux).
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# ctset <- connectStates(ctset) # Connect states
# ctset <- selectTrajectory(ctset, component=1) # Select trajectory
# ctset <- fitTrajectory(ctset) # Align samples to trajectory
# Trajectory Graph Layout: Export
write.ygraphml(ctset, file='your/file/path/yourFileName.graphml')
Let’s open the exported graphml file in yEd:
If a layout has already been defined for a CellTrailsSet object, the samples’ coordinates will be listed in the exported graphml file and will be directly interpreted by yEd. In this example, no layout was defined yet and therefore all samples (nodes) are overlapping.
Next, we layout the graph. Since the trajectory resembles a tree structre, we use a tree algorithm. We found that routing the trajectory graph in a quasi-radial style, called balloon style, works very well for our case. The layouter can be selected via Layout > Tree > Balloon. The following parameter setting was used in the original CellTrails publication:
Let’s run the algorithm to compute the layout:
Please note, that edge crossings (i.e. two or more edges overlap) are not useful and if they occur we sugget to re-run the layouter with different parameters before saving the layout. Finally, the file can be saved: File > Save and reimported to the CellTrailsSet object.
# Trajectory Graph Layout: Import
ctset <- read.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
adjust=TRUE)
The parameter adjust indicates if edge lengths need to be adjusted, such that they correspond to the inferred pseudotime.
The export function write.ygraphml colorizes automatically nodes by their assigned state. However, it is possible to colorize nodes by any phenotype or feature expression data stored in the CellTrailsSet object. For example, we could colorize the nodes by the recorded FM1-43 dye intensity to get an idea where the trajectory might start and end (a high FM1-43 dye indicates mature cells).
Please note, that nodes with a missing phenotype information are not colorized.
# Trajectory Graph Layout: Export
write.ygraphml(ctset, file='your/file/path/yourFileName.graphml',
pheno_type="FM143")
Let’s open the colorized trajectory graph again in yEd.
Now, we can analyze the distances between samples, which represent the passed pseudotime, and the FM1-43 uptake in detail. Using either your mouse or the View > Zoom In option allows to have a closer look. One can recognize that cells with high FM1-43 uptake are located on the bottom left of the layout. If we want to have the trajectory progressing from bottom left to bottom right, we need to transform the graph. This can be done via Tools > Geometric transformations. Here, we select Mirror on Y axis and Mirror on X axis.
We save the file and import it to CellTrails again. This time we set the parameter adjust to FALSE, since the pseudotime information has been already integrated.
# Trajectory Graph Layout: Import
ctset <- read.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
adjust=FALSE)
Sidenote. It is not a requirement to use yEd. The trajectory layout can also be defined by the user otherwise. The minimum requirement is, that the coordinates of each sample are stored in a data.frame whose row names correspond to the sample names. The sampleNames can be pulled from the CellTrailsSet object using the function sampleNames. The layout can then be set using the accessor function trajectoryLayout. For this tutorial, the CellTrails package comes with a list named trajLayouts which contains processed layouts for each dataset included in this package.
# Load layout example
data(trajLayouts)
head(trajLayouts$gga_e15_utricle)
## D1 D2
## Cell-1-1 13460.515 -11959.3933
## Cell-1-2 5888.370 -22186.9484
## Cell-1-3 1362.573 -14050.0538
## Cell-1-4 14755.370 -8483.2582
## Cell-1-5 8817.691 -482.6183
## Cell-1-6 15634.970 -14445.4859
par(mar=c(0,0,0,0))
plot(trajLayouts$gga_e15_utricle, pch=20, cex=.25, axes=FALSE)
# Set individual layout and adjust it
trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle
par(mar=c(0,0,0,0))
plot(trajectoryLayout(ctset), pch=20, cex=.25, axes=FALSE)
After generating the layout, a two-dimensional visualization of the trajectory can be drawn. Here, the line represents the trajectory and individual points represent the samples. This plot type either colorizes individual samples by a metadata label (parameter pheno_type) or it shows the topography of a given feature’s expression landscape (defined by parameter feature_name). When metadata are being visualized, the grey line represents the trajectory and the individual points represent the cells. Cells that do not have a specific metadata label or value are not shown. Let’s visualize how the cellular FM1-43 uptake distributes along the trajectory:
plot(ctset, type="map", pheno_type="FM143")
In the topographical plot, a smooth surface is fitted and values are predicted for a regular grid resulting in the shown map topography; the red line signifies the trajectory. Let’s take a view into the ATOH1 expression landscape, an early key transcription factor during sensory hair cell development:
plot(ctset, type="map", feature_name="ATOH1")
Considering that the trajectory towards mature hair cells starts on the lower left, we can appreciate that ATOH1 exhibits a transient expression peak before the trajectory bifurcates into more specialized cellular phenotypes (on the upper left and upper right side of the map).
Let’s plot TMC2, which was found specifically expressed in one type of hair cells:
plot(ctset, type="map", feature_name="TMC2")
We can also visualize the standard error of the predicted expression surface using the parameter map_type.
plot(ctset, type="map", feature_name="TMC2", map_type="se")
Please note, that it is also possible to show the expression of single samples only. This can be achieved by setting the parameter map_type to single, respectively.
plot(ctset, type="map", feature_name="TMC2", map_type="single")
If needed this plot can also be exported to yEd using setting the parameter pheno_type of the export function, respectively. We require that the labels of the nodes are the sample names by setting the parameter label, respectively.
# Trajectory Graph Layout: Export
write.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
feature_name="TMC2",
label="NAME")
In this part of the tutorial, CellTrails is used to infer expression dynamics of features along an individual trail. A trail denotes a path between two landmarks, namely branching points (B) and end points. The latter are referred to as trail heads (H). For this purpose, CellTrails implements a convenient plot function (with parameter type=trailblazing), which displays the available landmark points on the trajectory map:
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# ctset <- connectStates(ctset) # Connect states
# ctset <- selectTrajectory(ctset, component=1) # Select trajectory
# ctset <- fitTrajectory(ctset) # Align samples to trajectory
# trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle # Layout
# Trail Identification
plot(ctset, type="trailblazing")
Based on the experimental meta information and the expression pattern of marker genes, we identified path B3 to H9 as developmental trail toward a striolar sensory hair bundle morphology, and B3 to H1 as developmental trail toward an extrastriolar bundle morphology. Let’s mark those trails on the map using the function addTrail.
ctset <- addTrail(ctset, from="B3", to="H9", name="TrS")
ctset <- addTrail(ctset, from="B3", to="H1", name="TrES")
With this function CellTrails automatically extracts the samples and their pseudotime along the trail by computing the shortest path between the trail start and end. Let’s make sure that the intended trails were extracted:
plot(ctset, type="trail", trail_name="TrS")
plot(ctset, type="trail", trail_name="TrES")
These plots show the trajectory map and highlight the defined trails along with its corresponding pseudotime.
Please note, that trails can be renamed with trailNames<- and removed with removeTrail, respectively.
It might be needed to define subtrails if trails overlap. This is neccessary if the dynamics of one trail are subdynamics of another trail. Because pseudotime mirrors the location of each datapoint in the latent space, a significant gap in pseudotime could indicate separate cell populations. However, these populations have only subtle gene expression profile differences and were linearly aligned in the latent space. Since pseudotime can also be interpreted as a function of transcriptional change, one can argue that these populations undergo the same expression program (for the selected features), with the small but distinct difference that samples ordered at the terminal end of the longer trail up- or down-regulate additional features late during their maturation. Thus, trails can overlap, while one trail is a subtrail of the longer trail.
The U landmarks that are needed to define a subtrail can be determined by the user, as demonstrated in the following.
First, we want to give a rational for selecting a specific node. As described in the original CellTrails article, we found a gap in pseudotime near the terminal end of trail TrES, which might indicate that the terminal state can be split, and two trails are actually overlapping. This gap becomes already quite obvious when we utilize yEd to for a closer look into the trajectory graph. First, we export the graph and colorize the nodes by state, and label the nodes by name.
# Trajectory Graph Layout: Export
write.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
pheno_type="STATE",
label="NAME")
Then we open the graphml file in yEd. The gap in the purple colored population is obvious:
To indicate this sample as landmark, we simply change the shape of this node. This can be any shape, but not ellipse, which is used for other nodes. The shape can be changed using the properties view panel on the right border of the yEd application.
After saving the layout, it can be reimported to CellTrails and the landmark can be used to define the subtrail:
# Trajectory Graph Layout: Import
ctset <- read.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
adjust=FALSE)
# Trail Identification
plot(ctset, type="trailblazing")
ctset <- addTrail(ctset, from="B3", to="U1", name="TrES*")
plot(ctset, type="trail", trail_name="TrES*")
A visual and empiric identification of U landmarks can be helpful, but scientifically more appropriate is a statistical approach. For this purpose we analyze the distribution of all lagged differences along trail TrES. Here, we make use of the function trailPseudotime, which enables to extract the pseudotime information from each trail, and the function trailStates which returns the state information, respectively.
# Extract required information
tPtime <- trailPseudotime(ctset, trail_name="TrES")
tStates <- trailStates(ctset, trail_name="TrES")
# Lagged pseudotime values per state
tPtime_tStates <- split(tPtime, tStates)
tPtime_tStates_diff <- lapply(tPtime_tStates,
function(x){y <- diff(x); y[-length(y)]})
bp <- boxplot(tPtime_tStates_diff, horizontal=TRUE,
ylab="State", xlab="Pseudotime delta", las=2)
The boxplot statistics indicate that there is a strong outlier in state S9 (which is termed state i in the original CellTrails article). We can use the function trailSamples to extract the two samples which are separated by this gap.
# All trail samples
tSamples <- trailSamples(ctset, trail_name="TrES")
# Trail samples of state S9 and lagged pseudotime
tSamples <- tSamples[tStates == "S9"]
tPtime_diff <- tPtime_tStates_diff$S9
# Sample names
tSamples[which.max(tPtime_diff)]
## [1] "Cell-8-57"
tSamples[which.max(tPtime_diff) + 1]
## [1] "Cell-4-23"
Either we indicate this landmark in yEd as described above. Samples can searched in the trajectory graph by Edit > Find > Nodes.
Or we use the function addLandmark directly in our R session.
ctset <- addLandmark(ctset, sample_name="Cell-8-57")
# Trail Identification
plot(ctset, type="trailblazing")
ctset <- addTrail(ctset, from="B3", to="U1", name="TrES*")
plot(ctset, type="trail", trail_name="TrES*")
Please note, that the trajectory graph can also be exported having all landmarks highlighted. This is particulary helpful if user-defined landmarks need to be changed.
# Export Trajectory Graph Layout with sample names
write.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
pheno_type="LANDMARK",
label="NAME")
# Export Trajectory Graph Layout with landmark ids
write.ygraphml(ctset,
file='your/file/path/yourFileName.graphml',
pheno_type="LANDMARK")
CellTrails defines pseudotime as the geodesic distance of each node of the trail from the start node; it is scaled between 0 and 100% for each trail. To learn the expression level of a feature as a function of pseudotime, CellTrails used generalized additive models (GAM) with a single smoothing term with five basis dimensions. Here, for each feature, CellTrails introduces prior weights for each observation to lower the confounding effect of missing data to the maximum-likelihood-based fitting process.
Feature expression as a function of pseudotime along an individual trail can be plotted with the plot function with parameter type="dynamic". This results in the fitted dynamic function (= black line) and the individual expression per sample (= points represent samples colored by their state membership). For example, the expression of the calcium buffer CALB2 during extrastriolar hair cell development can be displayed as follows:
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# ctset <- connectStates(ctset) # Connect states
# ctset <- selectTrajectory(ctset, component=1) # Select trajectory
# ctset <- fitTrajectory(ctset) # Align samples to trajectory
# trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle # Layout
# ctset <- addTrail(ctset, from="B3", to="H1", name="TrES") # Define trail TrES
plot(ctset, type="dynamic", feature_name="CALB2", trail_name="TrES")
The fitting information can be extracted via function fitDynamic and used for further downstream analyses:
fit <- fitDynamic(ctset, trail_name="TrES", feature_name="CALB2")
summary(fit)
## Length Class Mode
## pseudotime 470 -none- numeric
## expression 470 -none- numeric
## gam 52 gam list
range(fit$pseudotime)
## [1] 0 1
range(fit$expression)
## [1] 7.131096 15.419633
# Predict expression at 0%, 25%, 50%, 75% and 100% of pseudotime
predict(fit$gam, newdata=data.frame(x=c(0, .25, .5, .75, 1)))
## 1 2 3 4 5
## 7.131096 11.917651 14.649179 15.117753 10.831009
CellTrails allows the analysis and comparison of the expression of multiple features along a single trail. For example, the expression dynamics of the acting crosslinkers FSCN1 and FSCN2 can be displayed in a single plot as follows:
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# ctset <- connectStates(ctset) # Connect states
# ctset <- selectTrajectory(ctset, component=1) # Select trajectory
# ctset <- fitTrajectory(ctset) # Align samples to trajectory
# trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle # Layout
# ctset <- addTrail(ctset, from="B3", to="H1", name="TrES") # Define trail TrES
plot(ctset, type="dynamic",
feature_name=c("FSCN1", "FSCN2"),
trail_name="TrES")
By using the fitting function fitDynamic, the similarity/correspondence between curves can be quantified. This allows a quantitative assessment of the observed anticorrelation seen in the plot above between FSCN1 and FSCN2:
fscn1_fit <- fitDynamic(ctset, trail_name="TrES", feature_name="FSCN1")
fscn2_fit <- fitDynamic(ctset, trail_name="TrES", feature_name="FSCN2")
# Correlation
cor(fscn1_fit$expression, fscn2_fit$expression)
## [1] -0.6432156
Genes have non-uniform expression rates and each trail has a distinct set of upregulated features, but also contains unequal numbers of samples. Since pseudotime is computed based on expression differences between individual samples, the pseudotime axis may be distorted, leading to stretched or compressed sections of longitudinal expression data that make comparisons of such trails challenging. To align different trails, despite these differences, CellTrails employs a strategy that has long been known in speech recognition, called dynamic time warping (Sakoe and Chiba 1978). Feature expression dynamics are modeled analogous to how dynamic time warping is used to align phonetic dynamics present in speech. Innate non-linear variation in the length of individual phonemes (i.e. states) is appropriately modeled, which results in stretching and shrinking of word (i.e. trail) segments. This allows the computation of intertrail alignment warps of individual expression time series that are similar but locally out of phase. The overall dissimilarity between two expression time series can be estimated by the root-mean-square deviation (RMSD), the total deviation (TD), the area between curves (ABC), or Pearson’s correlation coefficient (COR) over all aligned elements. The warp and the corresponding quantitative score can be computed using the function contrastTrailExpr.
# Required steps upstream
# ctset <- as.CellTrailsSet(gga_e15_utricle) # Create container
# ctset <- embedSamples(ctset) # Embed samples
# ctset <- reduceDimensions(ctset, findSpectrum(ctset)) # Reduce dimensionality
# ctset <- findStates(ctset) # Find states
# ctset <- connectStates(ctset) # Connect states
# ctset <- selectTrajectory(ctset, component=1) # Select trajectory
# ctset <- fitTrajectory(ctset) # Align samples to trajectory
# trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle # Layout
# ctset <- addTrail(ctset, from="B3", to="H9", name="TrS") # Define trail TrS
# ctset <- addTrail(ctset, from="B3", to="H1", name="TrES") # Define trail TrES
# Compare ATOH1 dynamic
# Root-mean-square deviation
contrastTrailExpr(ctset, feature_name="ATOH1",
trail_names=c("TrS", "TrES"), score="RMSD")
## ATOH1
## 0.0321191
# Total deviation
contrastTrailExpr(ctset, feature_name="ATOH1",
trail_names=c("TrS", "TrES"), score="TD")
## ATOH1
## 6.364569
# Area between curves
contrastTrailExpr(ctset, feature_name="ATOH1",
trail_names=c("TrS", "TrES"), score="ABC")
## ATOH1
## 0.02300158
# Pearson's correlation coefficient
contrastTrailExpr(ctset, feature_name="ATOH1",
trail_names=c("TrS", "TrES"), score="COR")
## ATOH1
## 0.9999033
To identify features that differ between two trails, we can compute the divergence for all features and analyze the Z-score distribution as derived by scale:
# Compare TrS and TrES dynamics
# Root-mean-square deviation
all_rmsd <- contrastTrailExpr(ctset,
trail_names=c("TrS", "TrES"), score="RMSD")
# Identify highly differing features
all_rmsd <- all_rmsd[all_rmsd > 0]
zscores <- scale(log(all_rmsd))
sort(all_rmsd[zscores > 1.65])
## SYN3 AKAP5 OCM SLC8A1 ATP2B2 MYO1H CAB39L MCOLN3
## 1.494186 1.529848 1.601327 1.903308 1.953126 1.981341 2.304602 2.417522
## CHRNA10 MYO3A CIB2 TNNC2 SKOR2 LOXHD1 TMC2
## 2.637483 2.671318 2.790144 3.051271 3.717096 4.129763 4.799791
In the case one wants to compare a large number of features (e.g. from an RNA-Seq experiment), the computation can be significantly sped up by parallel computing. In this example, we use the package doSNOW, but any other package may also be used for this purpose.
library(doSNOW)
# Register parallel backend
cpu.cl <- makeCluster(parallel::detectCores() * 2)
registerDoSNOW(cpu.cl)
# Compute warps
fnames <- featureNames(ctset)
all_rmsd <- foreach(i=seq_along(fnames), .combine=rbind) %dopar% {
g <- fnames[i]
CellTrails::contrastTrailExpr(ctset,
feature_name=g,
trail_names=c("TrES", "TrS"),
score="RMSD")
}
stopCluster(cpu.cl)
all_rmsd <- all_rmsd[, 1]
names(all_rmsd) <- fnames
Please note, that the advantage in computation time increases with the number of features; for a small number of features the parallel computing approach may be slower than simply calling contrastTrailExpr with feature_name = featureNames(ctset) due to its overhead.
The following protocols describe how this package can be used to perform the data analysis shown in the original CellTrails article.
# Load expression data
data("gga_e15_utricle")
# Create Container
ctset <- as.CellTrailsSet(gga_e15_utricle)
# Manifold Learning
ctset <- embedSamples(ctset)
ctspec <- findSpectrum(ctset)
ctset <- reduceDimensions(ctset, ctspec)
# Clustering
ctset <- findStates(ctset)
# Sample Ordering
ctset <- connectStates(ctset)
ctset <- selectTrajectory(ctset, component=1)
ctset <- fitTrajectory(ctset)
# CellTrails maps
# Please note: For illustration purposes, the layout was
# computed in yEd using functions write.ygraphml and
# read.ygraphml and stored in object trajLayouts
trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$gga_e15_utricle
# Define subtrail by adding a user-defined landmark
ctset <- addLandmark(ctset, sample_name="Cell-8-57")
# Analysis of Expression Dynamics
ctset <- addTrail(ctset, from="B3", to="H9", name="TrS")
ctset <- addTrail(ctset, from="B3", to="H1", name="TrES")
ctset <- addTrail(ctset, from="B3", to="U1", name="TrES*")
# Inter-trail comparison (similar to Figure 5B)
rmsd_all <- contrastTrailExpr(ctset, trail_names=c("TrS", "TrES"))
rmsd_all <- rmsd_all[rmsd_all > 0]
sort(rmsd_all[scale(log(rmsd_all)) > 1.65])
# -------------------------------
# Visualizations
# -------------------------------
# Plot spectrum
plot(ctspec)
# Plot size of clusters (similar to Figure 2E)
plot(ctset, type="stateSize")
# Plot expression distribution
plot(ctset, type="stateExpression", feature_name="TECTA")
# Plot latent space (similar to Figure S4F)
ggp <- plot(ctset, type="latentSpace", pheno_type="state"); ggp
plot(ctset, type="latentSpace", pheno_type="FM143", viz=ggp)
plot(ctset, type="latentSpace", feature_name="TECTA", viz=ggp)
# Plot state trajectory graph (similar to Figure 1G)
plot(ctset, type="stateTrajectoryGraph", component=1, pheno_type="FM143",
point_size=1.5, label_offset=4)
plot(ctset, type="stateTrajectoryGraph", component=1, pheno_type="ORIGIN",
point_size=1.5, label_offset=4)
plot(ctset, type="stateTrajectoryGraph", component=1, feature_name="CALB2")
# Plot trajectory fit (similar to Figure 2E)
plot(ctset, type="trajectoryFit", rev=TRUE)
# Plot CellTrails maps (similar to Figure 3 and Table S2)
plot(ctset, type="map", pheno_type="ORIGIN")
plot(ctset, type="map", pheno_type="FM143")
plot(ctset, type="map", feature_name="LOXHD1")
plot(ctset, type="map", feature_name="TMC2")
plot(ctset, type="map", feature_name="ATP2B2")
plot(ctset, type="map", feature_name="TNNC2")
plot(ctset, type="map", feature_name="SKOR2")
plot(ctset, type="map", feature_name="CALB2")
plot(ctset, type="map", feature_name="OTOA")
plot(ctset, type="map", feature_name="TECTA")
plot(ctset, type="map", feature_name="TECTB")
plot(ctset, type="map", feature_name="ATOH1")
plot(ctset, type="map", feature_name="KLHDC7A")
plot(ctset, type="map", feature_name="POU4F3")
# Plot trailblazing
plot(ctset, type="trailblazing")
# Plot trails (similar to Figure 4K)
plot(ctset, type="trail", trail_name="TrS")
plot(ctset, type="trail", trail_name="TrES")
plot(ctset, type="trail", trail_name="TrES*")
# Plot single dynamics (similar to Figure 4B,G and Table S2)
plot(ctset, type="dynamic", feature_name="CALB2", trail_name="TrES")
plot(ctset, type="dynamic", feature_name="ATP2B2", trail_name="TrES")
# Compare dynamics (similar to Figure 6A)
plot(ctset, type="dynamic", feature_name=c("TECTA", "OTOA", "ATOH1", "POU4F3",
"MYO7A", "CALB2", "SYN3", "SKOR2",
"ATP2B2", "LOXHD1", "MYO3A", "TMC2",
"TNNC2"), trail_name="TrES")
plot(ctset, type="dynamic", feature_name=c("TECTA", "OTOA", "ATOH1", "POU4F3",
"MYO7A", "CALB2", "SYN3", "SKOR2",
"ATP2B2", "LOXHD1", "MYO3A", "TMC2",
"TNNC2"), trail_name="TrS")
In the following, we provide a protocol to analyze the publicly-available dataset containing single-cell RNASeq measurements of 14,313 genes in 120 cells from P1 mouse utricles (GEO accession code: GSE71982). Experimental metadata was generated during cell sorting (GFP and tdTomato fluorescence indicating major cell types). The processed data (mmu_p1_utricle.rda) can be downloaded as Biobase::ExpressionSet from here.
If you use this dataset for your research, please cite:
Burns JC, Kelly MC, Hoa M, Morell RJ, Kelley MW. “Single-cell RNA-Seq resolves cellular complexity in sensory organs from the neonatal inner ear”. Nat Commun. 2015 Oct 15;6:8557. doi: 10.1038/ncomms9557.
# Load expression data
load("mmu_p1_utricle.rda")
# Create Container
ctset <- as.CellTrailsSet(mmu_p1_utricle)
# Feature Selection
ctset <- filterFeaturesByDL(ctset, threshold=2)
ctset <- filterFeaturesByCOV(ctset, threshold=.5)
ctset <- filterFeaturesByFF(ctset)
# Manifold Learning
ctset <- embedSamples(ctset)
ctspec <- findSpectrum(ctset)
ctset <- reduceDimensions(ctset, ctspec)
# Clustering (parameters account for low sample size)
ctset <- findStates(ctset, max_pval=1e-3, min_feat=2)
# Sample Ordering
ctset <- connectStates(ctset)
ctset <- fitTrajectory(ctset)
# CellTrails maps
# Please note: For illustration purposes, the layout was
# computed in yEd using functions write.ygraphml and
# read.ygraphml and stored in object trajLayouts
trajectoryLayout(ctset, adjust=TRUE) <- trajLayouts$mmu_p1_utricle
# Analysis of Expression Dynamics
ctset <- addTrail(ctset, from="H1", to="H3", name="Tr1")
ctset <- addTrail(ctset, from="H1", to="H2", name="Tr2")
# Inter-trail comparison
rmsd_all <- contrastTrailExpr(ctset, trail_names=c("Tr1", "Tr2"))
rmsd_all <- rmsd_all[rmsd_all > 0]
sort(rmsd_all[scale(log(rmsd_all)) > 1.65])
# Alternative: using parallel computing using doSNOW
library(doSNOW)
cpu.cl <- makeCluster(parallel::detectCores() * 2)
registerDoSNOW(cpu.cl)
fnames <- featureNames(ctset)
all_rmsd <- foreach(i=seq_along(fnames), .combine=rbind) %dopar% {
g <- fnames[i]
CellTrails::contrastTrailExpr(ctset, feature_name=g,
trail_names=c("Tr1", "Tr2"), score="RMSD")
}
stopCluster(cpu.cl)
all_rmsd <- all_rmsd[, 1]
names(all_rmsd) <- fnames
all_rmsd <- all_rmsd[all_rmsd > 0]
zscores <- scale(log(all_rmsd))
sort(all_rmsd[zscores > 1.65])
# -------------------------------
# Visualizations
# -------------------------------
# Plot spectrum
plot(ctspec)
# Plot trajectory fit (similar to Figure 7A)
plot(ctset, type="trajectoryFit")
# Plot size of clusters (similar to Figure 7A)
plot(ctset, type="stateSize")
# Plot expression distribution
plot(ctset, type="stateExpression", feature_name="Tecta")
# Plot latent space
ggp <- plot(ctset, type="latentSpace", pheno_type="gate"); ggp
plot(ctset, type="latentSpace", feature_name="Myo7a", viz=ggp)
# Plot state trajectory graph
plot(ctset, type="stateTrajectoryGraph", component=1, pheno_type="gate",
point_size=2, label_offset=5)
plot(ctset, type="stateTrajectoryGraph", component=1, feature_name="Sox2")
# Plot CellTrails maps (similar to Figure 7C-E)
plot(ctset, type="map", pheno_type="gate")
plot(ctset, type="map", feature_name="Tecta")
plot(ctset, type="map", feature_name="Atoh1")
plot(ctset, type="map", feature_name="Fscn2")
plot(ctset, type="map", feature_name="Ocm")
plot(ctset, type="map", feature_name="Sox2")
# Plot trailblazing
plot(ctset, type="trailblazing")
# Plot trails (similar to Figure 7F)
plot(ctset, type="trail", trail_name="Tr1")
plot(ctset, type="trail", trail_name="Tr2")
# Plot single dynamics (similar to Figure 7I)
plot(ctset, type="dynamic", feature_name="Fgf21", trail_name="Tr2")
plot(ctset, type="dynamic", feature_name="Fgf21", trail_name="Tr1")
plot(ctset, type="dynamic", feature_name="AI593442", trail_name="Tr2")
plot(ctset, type="dynamic", feature_name="AI593442", trail_name="Tr1")
# Compare dynamics
plot(ctset, type="dynamic", feature_name=c("Fscn1", "Fscn2"), trail_name="Tr2")
In this section, we illustrate that a CellTrails analysis can be performed in a reasonable period of time. The elapsed computational runtime of each function was measured on a MacBook Pro (Early 2015) with a 3.1 GHz Intel Core i7 processor, 16 GB 1867 MHz DDR3 RAM, and an Intel Iris Graphics 6100 1536 MB graphics card.
This dataset consists of 183 features and 1,008 samples.
| Function | Elapsed time (seconds) |
|---|---|
| Total Computation | 66.89 |
CellTrailsSet |
0.12 |
embedSamples |
35.77 |
findSpectrum |
0.01 |
reduceDimensions |
0.00 |
findStates |
9.67 |
connectStates |
0.26 |
selectTrajectory |
0.00 |
fitTrajectory |
2.18 |
write.ygraphml |
1.68 |
read.ygraphml |
0.17 |
addTrail |
0.01 |
contrastTrailExpr |
17.02 |
| Total Visualization | 11.41 |
plot(ctspec) |
0.01 |
plot(ctset, type='stateSize') |
0.01 |
plot(ctset, type='stateExpression') |
0.01 |
plot(ctset, type='latentSpace') |
7.59 |
plot(ctset, type='latentSpace', viz) |
0.01 |
plot(ctset, type='stateTrajectoryGraph', pheno_type) |
1.02 |
plot(ctset, type='stateTrajectoryGraph', feature_name) |
0.08 |
plot(ctset, type='trajectoryFit') |
0.51 |
plot(ctset, type='map', pheno_type) |
0.02 |
plot(ctset, type='map', feature_name) |
2.08 |
plot(ctset, type='trailblazing') |
0.02 |
plot(ctset, type='trail') |
0.02 |
plot(ctset, type='dynamic') |
0.03 |
Total time: about one to two minutes. Let’s assume that computing the layout takes about two minutes (starting yEd, running the layouter, saving the file), then the total runtime is up to five minutes.
This dataset consists of 18,827 features and 123 samples.
| Function | Elapsed time (seconds) |
|---|---|
| Total Computation | 620.759 |
CellTrailsSet |
0.11 |
filterFeaturesByDL |
1.36 |
filterFeaturesByCOV |
2.02 |
filterFeaturesByFF |
1.50 |
embedSamples |
1.19 |
findSpectrum |
0.00 |
reduceDimensions |
0.00 |
findStates |
6.77 |
connectStates |
0.01 |
fitTrajectory |
0.26 |
write.ygraphml |
0.20 |
read.ygraphml |
0.17 |
addTrail |
0.01 |
contrastTrailExpr* |
607.159 |
| Total Visualization | 2.24 |
plot(ctspec) |
0.01 |
plot(ctset, type='stateSize') |
0.01 |
plot(ctset, type='stateExpression') |
0.01 |
plot(ctset, type='latentSpace') |
0.87 |
plot(ctset, type='latentSpace', viz) |
0.02 |
plot(ctset, type='stateTrajectoryGraph', pheno_type) |
0.43 |
plot(ctset, type='stateTrajectoryGraph', feature_name) |
0.08 |
plot(ctset, type='trajectoryFit') |
0.02 |
plot(ctset, type='map', pheno_type) |
0.01 |
plot(ctset, type='map', feature_name) |
0.71 |
plot(ctset, type='trailblazing') |
0.01 |
plot(ctset, type='trail') |
0.03 |
plot(ctset, type='dynamic') |
0.03 |
Total time: about 10 to 11 minutes. Let’s assume that computing the layout takes about two minutes (starting yEd, running the layouter, saving the file), then the total runtime is up to 15 minutes. Of note, the runtime of contrastTrailExpr can be reduced to about 4 minutes (241.95s) from about 10 minutes using parallel computing.
We used yEd version 3.14.4. In the following we provide the information about the R session and the system used to compile this document:
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] CellTrails_0.99.0 Biobase_2.38.0 BiocGenerics_0.24.0
## [4] BiocStyle_2.6.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.2 Hmisc_4.0-3
## [3] RcppEigen_0.3.3.3.1 plyr_1.8.4
## [5] igraph_1.1.2 lazyeval_0.2.1
## [7] sp_1.2-5 shinydashboard_0.6.1
## [9] splines_3.4.3 BiocParallel_1.12.0
## [11] GenomeInfoDb_1.14.0 ggplot2_2.2.1
## [13] scater_1.6.1 digest_0.6.12
## [15] htmltools_0.3.6 viridis_0.4.0
## [17] magrittr_1.5 checkmate_1.8.5
## [19] memoise_1.1.0 cluster_2.0.6
## [21] limma_3.34.3 matrixStats_0.52.2
## [23] xts_0.10-0 prettyunits_1.0.2
## [25] colorspace_1.3-2 blob_1.1.0
## [27] ggrepel_0.7.0 xfun_0.1
## [29] dplyr_0.7.4 RCurl_1.95-4.8
## [31] jsonlite_1.5 tximport_1.6.0
## [33] EnvStats_2.3.0 lme4_1.1-14
## [35] bindr_0.1 survival_2.41-3
## [37] zoo_1.8-0 glue_1.2.0
## [39] gtable_0.2.0 zlibbioc_1.24.0
## [41] XVector_0.18.0 MatrixModels_0.4-1
## [43] DelayedArray_0.4.1 cba_0.2-19
## [45] car_2.1-6 kernlab_0.9-25
## [47] SingleCellExperiment_1.0.0 prabclus_2.2-6
## [49] DEoptimR_1.0-8 SparseM_1.77
## [51] VIM_4.7.0 abind_1.4-5
## [53] scales_0.5.0 mvtnorm_1.0-6
## [55] DBI_0.7 edgeR_3.20.1
## [57] Rcpp_0.12.14 dtw_1.18-1
## [59] laeken_0.4.6 viridisLite_0.2.0
## [61] xtable_1.8-2 progress_1.1.2
## [63] htmlTable_1.11.0 foreign_0.8-69
## [65] bit_1.1-12 proxy_0.4-20
## [67] mclust_5.4 Formula_1.2-2
## [69] stats4_3.4.3 DT_0.2
## [71] vcd_1.4-4 htmlwidgets_0.9
## [73] FNN_1.1 RColorBrewer_1.1-2
## [75] fpc_2.1-10 acepack_1.4.1
## [77] modeltools_0.2-21 pkgconfig_2.0.1
## [79] XML_3.98-1.9 flexmix_2.3-14
## [81] nnet_7.3-12 locfit_1.5-9.1
## [83] dynamicTreeCut_1.63-1 labeling_0.3
## [85] rlang_0.1.4 reshape2_1.4.3
## [87] AnnotationDbi_1.40.0 munsell_0.4.3
## [89] tools_3.4.3 RSQLite_2.0
## [91] evaluate_0.10.1 stringr_1.2.0
## [93] maptree_1.4-7 yaml_2.1.16
## [95] knitr_1.17 bit64_0.9-7
## [97] robustbase_0.92-8 rgl_0.98.22
## [99] purrr_0.2.4 dendextend_1.6.0
## [101] bindrcpp_0.2 nlme_3.1-131
## [103] quantreg_5.34 whisker_0.3-2
## [105] mime_0.5 scran_1.6.6
## [107] biomaRt_2.34.0 pbkrtest_0.4-7
## [109] compiler_3.4.3 rstudioapi_0.7
## [111] curl_3.1 beeswarm_0.2.3
## [113] e1071_1.6-8 smoother_1.1
## [115] tibble_1.3.4 statmod_1.4.30
## [117] stringi_1.1.6 highr_0.6
## [119] lattice_0.20-35 trimcluster_0.1-2
## [121] circular_0.4-93 Matrix_1.2-12
## [123] nloptr_1.0.4 lmtest_0.9-35
## [125] depth_2.1-1 data.table_1.10.4-3
## [127] bitops_1.0-6 httpuv_1.3.5
## [129] GenomicRanges_1.30.0 R6_2.2.2
## [131] latticeExtra_0.6-28 bookdown_0.7
## [133] gridExtra_2.3 vipor_0.4.5
## [135] IRanges_2.12.0 boot_1.3-20
## [137] MASS_7.3-47 assertthat_0.2.0
## [139] destiny_2.6.1 rhdf5_2.22.0
## [141] SummarizedExperiment_1.8.0 rprojroot_1.2
## [143] rjson_0.2.15 S4Vectors_0.16.0
## [145] GenomeInfoDbData_0.99.1 diptest_0.75-7
## [147] mgcv_1.8-22 grid_3.4.3
## [149] rpart_4.1-11 minqa_1.2.4
## [151] tidyr_0.7.2 class_7.3-14
## [153] rmarkdown_1.8 Rtsne_0.13
## [155] TTR_0.23-2 shiny_1.0.5
## [157] base64enc_0.1-3 ggbeeswarm_0.6.0
Angerer, P., L. Haghverdi, M. Buettner, FJ. Theis, C. Marr, and F. Buettner. 2015. “Destiny: Diffusion Maps for Large-Scale Single-Cell Data in R.” Bioinformatics 32: 1241–3.
Bedall, F.K., and H. Zimmermann. 1979. “Algorithm As143. the Mediancentre.” Appl Statist 28: 325–28.
Belkin, M., and P. Niyogi. 2003. “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.” Neural Computation 15: 1373–96.
Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society Series B 57: 289–300.
Buettner, F., KN. Natarajan, FP. Casale, V. Proserpio, A. Scialdone, FJ. Theis, SA. Teichmann, JC. Marioni, and O. Stegle. 2015. “Computational Analysis of Cell-to-Cell Heterogeneity in Single-Cell RNA-Sequencing Data Reveals Hidden Subpopulations of Cells.” Nature Biotechnology 33(2): 155–60.
Carlson, M. 2017. “Org.Mm.eg.db: Genome Wide Annotation for Mouse.” R Package Version 3.5.0.
Falcon, S., M. Morgan, and R. Gentleman. 2007. “An Introduction to Bioconductor’s ExpressionSet Class.” http://bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf.
Huber, W., V.J. Carey, R. Gentleman, S. Anders, M. Carlson, B.S. Carvalho, H.C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12: 115–21.
Kruskal, J.B. 1956. “On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem.” Proc Amer Math Soc 7: 48–50.
Lun, ATL., DJ. McCarthy, and JC. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122.
Mahata, B., X. Zhang, AA. Kolodziejczyk, V. Proserpio, L. Haim-Vilmovsky, AE. Taylor, D. Hebenstreit, et al. 2014. “Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids de Novo to Contribute to Immune Homeostasis.” Cell Reports 7(4): 1130–42.
Pages, H., M. Carlson, S. Falcon, and N. Li. 2017. “AnnotationDbi: Annotation Database Interface.” R Package Version 1.40.0.
Peto, R., and J. Peto. 1972. “Asymptotically Efficient Bank Invariant Test Procedures (with Discussion).” Journal of the Royal Statistical Society Series A 135: 185–206.
Sakoe, H., and S. Chiba. 1978. “Dynamic Programming Algorithm Optimization for Spoken Word Recognition.” IEEE Transactions on Acoustics, Speech, and Signaling Processing 26: 43–49.
Sussman, D.L., M. Tang, D.E. Fishkind, and C.E. Priebe. 2012. “A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs.” J Am Stat Assoc 107: 1119–28.
van der Maaten, L.J.P., and G.E. Hinton. 2008. “Visualizing High-Dimensional Data Using T-SNE.” Journal of Machine Learning Research 9: 2579–2605.
Ward, J.H. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58: 236–44.
Wickham, H. 2009. Elegant Graphics for Data Analysis. Springer-Verlag New York.